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Abstract 

TREC-2001 saw the falling into abeyance of the Large Web Task but a strengthening and broad- 
ening of activities based on the 1.69 million page WT10g corpus. There were two tasks. The topic 
relevance task was like traditional TREC ad hoc but used queries taken from real web search logs from 
which description and narrative fields of a topic description were inferred by the topic developers. 
There were 50 topics. In the homepage finding task queries corresponded to the name of an entity 
whose home page (site entry page) was included in WT10g. The challenge in this task was to return 
all of the homepages at the very top of the ranking. 

Cursory analysis suggests that once again, exploitation of link information did not help on the 
topic relevance task. By contrast, in the homepage finding task, the best performing run which did 
not make use of either link information or properties of the document’s URL achieved only half of 
the mean reciprocal rank of the best run. 


Introduction 


The TREC-9 Web Track activities centred on two tasks: A Topic Relevance Task and a HomePage 
Finding Task. Both made use of a 10 gigabyte, 1.69 million document subset of the VLC2, distributed 
on five CD-ROMs as the WT10g collection. [Bailey et al. 2001]. 


2 
2.1 


3. 


Guidelines 


This Year’s Aims 


. To extend the utility of the WT10g Web test collection by obtaining ” sufficiently complete” rele- 


vance judgements for 50 additional (correctly spelled) ad hoc (topic relevance) topics. 


To explore a different type of retrieval task (homepage finding) for which it is known that link-based 
methods can be beneficial. 


To investigate the benefit (or harm) of correctly implemented link methods on topic relevance. 


Participants are welcome to explore specific Web retrieval issues, such as: 


1. 


Can Distributed Information Retrieval techniques be used to improve retrieval effectiveness and/or 
efficiency? 


How well can systems accommodate to misspelled queries. Note that the intention is that the 
standard query set will be correctly spelled so that we maximise the chance of finding all the relevant 
answers. However, if participants are sufficiently interested, we could issue a set of misspelled 
variants of the judged queries. 


There are obviously many other interesting questions to ask about the Web data. 


2.2 Dataset 


The data for the TREC-9 Main Web Task is the 10 gigabyte WT10g [CSIRO 2001] collection, distributed 
by CSIRO. Note that this is entirely Web data. Documents include the information returned by the 
http daemon (enclosed in DOCHDR tags) as well as the page content. A draft paper [Bailey et al. 2001] 
describing the WT10g collection is available. 


2.3 Web Ad Hoc Task 


TREC-2001 ad hoc topics (topics 501-550) were created by NIST. They are available from the main 
TREC website [National Institute of Standards and Technology 1997]. They take a similar form to 
previous TREC Ad Hoc topics, but the topic title is a real Web query taken from search engine logs and 
the other fields are reverse engineered by NIST assessors. The additional fields are intended to define 
what the searcher wanted (but didn’t fully specify) when they typed their query. 

Systems are officially compared only on the basis of title-only queries, processed completely auto- 
matically. Queries using additional fields have no Web reality! However, despite this, participants were 
encouraged to submit additional interactive, manual and full topic statement runs to increase the discov- 
ery rate of relevant documents in the collection. As part of the automated submission process, participants 
were required to identify the type of each run. 

Official training data (distributed by NIST) consisted of the TREC-9 topics and qrels (topics 451-500). 
These were directly comparable with the TREC-2001 task. 


2.4 Home Page Finding Task 


NIST devised a set of 145 homepage finding queries. The process involved finding a homepage within 
WT10g and then composing a query designed to locate it. This is a known-item search task in which each 
known item is the entry page to a Website. As an example, the query “Text Retrieval Conference” might 
be generated for the http://trec.nist.gov/ homepage. A minimal amount of judging was required 
to determine if the URLs of documents returned by participants were in fact equivalent to the answer 
originally chosen. For example, http://allen.rad.nd.edu:80/ and http://rad.nd.edu/ both refer to 
the home page for the Notre Dame Radiation Laboratory. 

Systems are compared on the basis of the rank of the first correct answer. Measures include mean 
reciprocal rank of first correct answer and success rate (percentage of cases in which the correct answer 
or equivalent URL) occurred in the first N documents. 

A set of 100 queries and correct answers generated by Nick Craswell using a similar method were 
made available [CSIRO 2001] for training purposes. 

No manual or interactive query modification was permitted in this task. There was a blanket prohi- 
bition on tuning, tweaking or altering of systems based on examining the test queries. 


2.5 Indexing Restrictions 
There were none. Participants were permitted to index all of each document or exclude certain fields as 
they wished. 


2.6 Submissions and Judgments 


1. All submissions were due at NIST on or before 2 August 2001. 


2. An automated submission process was used which collected a small amount of information about 
each run. 


3. No. of runs submitted /judged. 
4. All judging was performed by NIST (not CSIRO) assessors. 


5. Judgments in the Web Ad Hoc task (not Homepage Finding) were TERNARY (nonrelevant, rele- 
vant, highly relevant) as they were last year. 


6. Judgments were made on the basis of the text within the document (only) 


7. Judges were not able to follow links. 


In the Topic Relevance task, 70400 documents were judged and 3363 were judged either relevant 
(2573) or highly relevant (790). 

In the Homepage Finding task, there were a total of 252 right answers over the 145 queries, an average 
of 1.74 right answers per query. However, the distribution of number of right answers per query was very 
skewed. For 132 queries there was only one right answer but for three queries there were more than 10 
right answers: query EP33 (Best Internet) - 25, query EP122 (Society for Technical Communications) - 
22, and query EP139 (The Leader OnLine) - 17). 

Best Internet seems to be (have been) an internet hosting company which controls a whole lot 
of internet domain names and presents all of them with its own homepage (prior to selling them 
to customers I presume). The URLs by which this page was accessible included: www.voici.com, 
www.avantisoft.com, www.panint.com, www.samoyed.org, www. cookiefactory.com, www.prost.org, 
www. bayberry.com, www.voici.com, www.biloxi-ms.com, www.globeprint.com, www.buoymedia.com, 
www.nm-solutions.com, www.growing.com, www. caber .com, apogee.best.com, 204.156.149.14, www. 
weblab.com, www.anymtnltd.com, www.romenet .com, www.spottedantelope.com, www.straw.com, www. 
jjsblues.com, www. jointventure.org, 204.156.144.1, www.mochinet.com, www.flick.com. 

By contrast, the multiple results for the Society for Technical Communications, seem to include some 
spurious answers. The real home page appears to be at www.stc-va.org/display.html but lots of the 
others judged equivalent are subsidiary pages or homepages of individual chapters or regions of STC. 

Finally, the multiple answers for the Online Leader, correspond to separate issues of an online 
publication. Each issue looks like a homepage but each has a specific date, eg. www.olympus.net/ 
leader /leaderonlineoctober23961023.htm. The page which you might expect to be a homepage 
(www.olympus.net/leader/index.htm1) also has a date. 

We considering URL depth to be the number of slashes in the URL after eliminating trailing slashes, 
we computed a histogram of the shallowest right answer for each of the queries. It turns out that 95 
of the 145 shallowest answers are actually at the very top level eg. africa.cis.co.za:81, amelia. 
experiment .db.erau.edu, dbcl13.cs.ust.hk01. Only 11 of the shallowest right answers are at a depth 
greater than 2. 


3 Results 


3.1 Topic Relevance Task 


Table 1 gives details of the 77 official submissions in the title-only, automatic category of the Topic 
Relevance task. The best performing run fub01be2 (FUB) did not make use of links, document structure, 
or URL text. Features listed for that run were: no-stemming, single-word indexing, novel probabilistic 
term weighting model, automatic query expansion. 

The second best run JuruFull (IBM-Haifa) used document structure and referring anchortext. Fea- 
tures listed for that run were: Vector space model, using lexical affinites, Porter stemming, slight stop- 
word filtering. 

The best run from the third-ranked group (Ricoh) used only document content. Features listed for 
that run were: Probabilistic model, Query expansion, Automatic parameter value estimation 

The best run from the fourth ranked group (JustSystem) made use of link information but at this stage 
it is unclear how. Features listed for that run were: vector space search, reference DB, pseudo-relevance 
feedback 

In summary, it was possible to achieve top performance using document content only. Automatic 
query expansion was used by most of the top ranked runs. There was no clear advantage to either 
probabilistic or vector space approaches. 


Table 1 presents early precision results for the same official title-only runs. 

Table 3 gives details for the 20 other runs, including two manual runs. The best full-topic automatic 
run performed 27% better than the best title-only run. Interestingly, it made use of URL text as well as 
page content. 


3.2 Home Page Finding Task 


Table 4 gives details of all 43 official runs in the Home Page Finding task. Interestingly, the top 23 runs in 
this table all made use of either URLtext or links (or both). The best run which did not IBMHOMENR) 
achieved an MRR score only half as high as that of the top ranked run. It made use of document structure. 
The highest ranked run which used content only achieved an MRR score only 30% of the best and found 
a right answer in the top 10 only half as often. 

The performance of the top ranked run (tnout10epCAU) is quite impressive. It found a right answer 
in the top 10 in nearly 90% of cases. The features of this run were listed as follows: Unigram language 
model URL text priors (based on depth of URL-path) content run merged with seperate anchor-text 
run. Interestingly, a companion run which did not use anchor text scored almost as well, reflecting the 
importance of URL depth as a feature on this task - at least for this set of queries on this collection. 
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Table 1: All official submissions in the title-only, automatic topic relevance task, ranked on average precision. 


FUB 


fub0lbe2 
JuruFull 
JuruFullQE 
ricMM 
ricAP 
ricMS 


JuruPruned01 
JuruPrune005 


jscbtawtl4 
jscbtawtl3 
Lemur 
fub01ne2 
jscbtawtl2 
eki Goro 
hum01tlx 
ricST 
msrenl 
ok10wt1 
fubOlidf 
tnout10t2 
iitOltfe 
jscbtawtl1 
msren4 
msren2 
fubO1ne 
hum0O1tl 
msren3 
posnirQ1rpt 
pirlWt2 
flabxt 
UniNEtdL 
flabxtl 
UniNEt7dL 
fdut10wtc01 
pirlWt1 
UniNEtd 
tnout10t1 
hum0O1t 
apl10we 
fdut10wtl01 
posnirO1st 
posnirQ1pt 
iitOlt 
ARCJO 
ARCJ5 
Merxt 
uwmtaw2 
uwmtaw1 


PDWTAHDR 


Ntvenx2 
yeahtb01 
yeaht01 
Ntvenx1 


PDWTAHWL 


Ntvfnx3 
ajouai0103 
ajouai0101 
csiroQawal 
uncvsms 
Ntvinx4 
uwmtaw0 
csiroQawa3 
icadhoc3 
ictweb10n 
ictweb10nl 
PDWTAHPR 
apl10wa 
csiroQawa2 
apl10wb 
uncfsls 
PDWTAHTL 
icadhoc1 
ictweb10nf 
ictweb10nfl 
icadhoc2 
irtLnut 
irtLnua 


IBM-Haifa 
IBM-Haifa 
ricoh 

ricoh 

ricoh 
IBM-Haifa 
IBM-Haifa 
Justsystem 
Justsystem 
cmu-lti 

FUB 
Justsystem 
microsoft 
hummingbird 
ricoh 
microsoft-china 
microsoft 

FUB 
tno/utwente 
IT 

Justsystem 
microsoft-china 
microsoft-china 
FUB 
hummingbird 
microsoft-china 
postech 

cuny 

Fujitsu 
Neuchatel 
Fujitsu 
Neuchatel 
Fudan 

cuny 
Neuchatel 
tno/utwente 
hummingbird 
apl-jhu 

Fudan 

postech 

postech 

IIT 

ibm-web 
ibm-web 

IRIT 

waterloo 
waterloo 
padova 
nextrieve 
Yonsei 

Yonsei 
nextrieve 
padova 
nextrieve 

ajou 

ajou 

CSIRO 
uncYang 
nextrieve 
waterloo 
CSIRO 
imperial 
chinese_academy 
chinese_academy 
padova 

apl-jhu 

CSIRO 

apl-jhu 
uncYang 
padova 
imperial 
chinese_academy 
chinese_academy 
imperial 
uncNewby 
uncNewby 


HA 


HASH ASH a ea e ak ae d e dak e dee ek dek eae e d AA AAA d e e ak dea 


E d E 0.2226 


KKK IK 


KK 


Tet(100) | ret(1000) 


17.38 





Table 2: All official submissions in the title-only, automatic topic relevance task, ranked on precision at 10 
documents retrieved. 


[Rusia Group Fields [ Struct. | URitext [inks [Pes _[ Polo] Pe20_] 
JuruFull IBM-Haifa Y Y E 0.4320 0.3620 -3130 
JuruPrune005 IBM-Haifa 
JuruPruned01 IBM-Haifa 
JuruFullQE IBM-Haifa 
fub01be2 FUB 
ricMM ricoh 
ricAP ricoh 
flabxt Fujitsu 
flabxtl Fujitsu 
fubOlidf FUB 
ok10wt1 microsoft 
ok10wt3 microsoft 
ricMS ricoh 
tnout10t2 tno/utwente 
humO1tlx hummingbird 
fub01lne2 FUB 
ricST ricoh 
fubOlne FUB 
yeaht01 Yonsei 
yeahtb01 Yonsei 
burn LE) hummingbird 
Lemur cmu-lti 
msren2 microsoft-china 
msren4 microsoft-china 
jscbtawtl4 Justsystem 
jscbtawtl3 Justsystem 
fdut10wtlO1 Fudan 
humO1t hummingbird 
msrenl microsoft-china 
msren3 microsoft-china 
fdutlOwtcO1 Fudan 
iitOl1t IIT 
jscbtawtl2 Justsystem 
jscbtawtl1 Justsystem 
Merxt IRIT 
posnirO1rpt postech 
iitOltfe IIT 
tnout10t1 tno/utwente 
UniNEtd Neuchatel 
PDWTAHDR padova 
UniNEtdL Neuchatel 
ARCJO ibm-web 
ARCJ5 ibm-web 
UniNEt7dL Neuchatel 
uwmtawl waterloo 


RK ezz KK KK KKK 


EO 


Ntvenx1 nextrieve 
PDWTAHWL padova 
posnirO1st postech 

Ntvenx2 nextrieve 
posnirQ1pt postech 
csiroQawal CSIRO 

uncvsms uncYang 
apl10we apl-jhu 
csiroQawa3 CSIRO 

pirlWt1 cuny 

uwmtaw2 waterloo 
Ntvfinx4 nextrieve 

piri Wt2 cuny 

Ntvfinx3 nextrieve 
csiroQawa2 CSIRO 
ajouai0101 ajou 

ajouai0103 ajou 

apl10wb apl-jhu 
uwmtaw0 waterloo 
icadhoc3 imperial 
ictweb10n chinese_academy 
PDWTAHPR padova 
ictweb10nl chinese_academy 
apl10wa apl-jhu 

icadhoc2 imperial 
icadhoc1 imperial 
PDWTAHTL padova 

uncfsls uncYang 
ictweb10nf chinese_academy 
ictweb10nfl chinese_academy 
irtLnut uncNewby 
irtLnua uncNewby 


KKKK e 


la is iz ir iz iz iz iz ir ir iz iz iz iz iz ake iz iz iz iz ir ir iz iz iz ir ir iz iz iz ek de e ee da ek dae aak e ir ir iz iz iz iz ir ir iz iz iz iz iz iz iz iz iz iz ir 
a 


SGS Og 





Table 3: All other (manual and long automatic) official submissions in the topic relevance task. Manual runs are 
marked with an asterisk. 


iit01m* IIT - - - 

ok10wtnd1 microsoft 
csiroOmwal* | CSIRO 
ok10wtnd0 microsoft 
flabxtd Fujitsu 
UniNEn7d Neuchatel 
hum01tdlx hummingbird 
kuadhoc2001 | kasetsart 
apll0wd apl-jhu 
posnir01ptd postech 
flabxtdn Fujitsu 
iit01tde IIT 
Merxtd IRIT 
pirlWa cuny 
fdut10wac01 | Fudan 
uncvsmm unc Yang 
fdut10wal01 Fudan 
yeahdb01 Yonsei 
yeahtd01 Yonsei 
uncfslm unc Yang 











Table 4: All official submissions in the homepage finding task. MRR is the mean reciprocal rank of the first 
correct answer. %top10 is the proportion of queries for which a right answer was found in the top 10 results. 
%fail is the proportion of queries in which no right answer was found in the top 100 results. 
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tnoutl0epCAU 
tnoutl0epCU 
jscbtawep2 
jscbtawep1 
jscbtawep4 
jscbtawep3 
yehp01 
yehpb01 
UniNEep1 
UniNEep2 
IBMHOMER 
flabxeall 
csirodawh2 
iitO1stb 
iitO1st 
UniNEep3 
VTEP 
msrcnp2 
csirodawh1 
UniNEep4 
msrcnp1 
flabxe75a 
ok10wahd1 
IBMHOMENR 
flabxemerge 
flabxet256 
ok10wahd0 
ok10whd1 
tnout10epC 
tnout10epA 
ok10whd0 
apl10ha 

ichp2 

apl10hb 
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PDWTEPDR 
PDWTEPWL 
VTBASE 
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PDWTEPTL 
PDWTEPPR 
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PDWTEPDR 
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Fujitsu 
CSIRO 

IIT 

IIT 
Neuchatel 
VT 
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Neuchatel 
microsoft-china 
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Fujitsu 
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tno/utwente 
microsoft 
apl-jhu 
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apl-jhu 
imperial 
kasetsart 
padova 
padova 
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padova 
padova 


mK KKK 


wd For ag tt 


SZ SZ SZ KS KS KKK 


SZ SZ SZ SZ KS KS KS KS KKS 


GA 1 


zr 


< 


SS rt 


'KKK''KKKK'! 


mK! 


KKK' 





